LLM 25-Day Course - Day 3: Understanding Tokenization and Embeddings

Day 3: Understanding Tokenization and Embeddings

LLMs cannot directly understand text. The first step of converting text into numbers is tokenization, and converting those numbers into meaningful vectors is embedding.

Tokenization Algorithm Comparison

Algorithm	Used By	Characteristics
BPE (Byte Pair Encoding)	GPT series	Repeatedly merges the most frequent byte pairs
WordPiece	BERT	Similar to BPE but merges based on likelihood
SentencePiece	T5, Llama	Language-independent, treats spaces as tokens
Unigram	mBART	Starts with a large vocabulary and removes low-probability tokens

Tokenization Practice with tiktoken

# pip install tiktoken
import tiktoken

# Tokenizer used by GPT-4
encoder = tiktoken.encoding_for_model("gpt-4")

text_ko = "Large language models understand natural language"
text_en = "Large language models understand natural language"

tokens_ko = encoder.encode(text_ko)
tokens_en = encoder.encode(text_en)

print(f"Korean: {len(tokens_ko)} tokens -> {tokens_ko}")
print(f"English: {len(tokens_en)} tokens -> {tokens_en}")

# Decode tokens back to text
for token_id in tokens_ko:
    print(f"  {token_id} -> '{encoder.decode([token_id])}'")

Korean requires more tokens than English to express the same meaning. This directly impacts cost and context window utilization.

BPE Algorithm Implementation

def simple_bpe(corpus, num_merges):
    """Simplified BPE algorithm"""
    # Initial: split into individual characters
    vocab = {}
    for word in corpus:
        chars = list(word) + ["</w>"]
        key = " ".join(chars)
        vocab[key] = vocab.get(key, 0) + 1

    for i in range(num_merges):
        pairs = {}
        for word, freq in vocab.items():
            symbols = word.split()
            for j in range(len(symbols) - 1):
                pair = (symbols[j], symbols[j + 1])
                pairs[pair] = pairs.get(pair, 0) + freq

        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        print(f"Merge {i+1}: '{best_pair[0]}' + '{best_pair[1]}'")

        # Merge the most frequent pair
        merged = " ".join(best_pair)
        replacement = "".join(best_pair)
        new_vocab = {}
        for word, freq in vocab.items():
            new_word = word.replace(merged, replacement)
            new_vocab[new_word] = freq
        vocab = new_vocab

    return vocab

corpus = ["low", "lower", "newest", "widest", "low", "low"]
simple_bpe(corpus, 5)

Understanding Vector Space with Word2Vec

# pip install gensim
from gensim.models import Word2Vec

# Simple training data
sentences = [
    ["king", "and", "queen", "live", "in", "palace"],
    ["queen", "rules", "the", "palace"],
    ["cat", "likes", "fish"],
    ["dog", "likes", "walks"],
]

model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, epochs=100)

# Check word vector
print(f"'king' vector dimensions: {model.wv['king'].shape}")

# Find similar words (accuracy is low due to small training data)
similar = model.wv.most_similar("king", topn=3)
for word, score in similar:
    print(f"  {word}: {score:.3f}")

Tokenization is the gateway to LLMs, and embeddings are the foundation for LLMs to understand language. Tomorrow we’ll learn about the Transformer architecture that operates on top of these embeddings.

Today’s Exercises

Install tiktoken and tokenize 3 Korean sentences and 3 English sentences. Calculate how many times more tokens Korean uses compared to English.
Experiment with increasing the number of merges in the BPE algorithm and observe how vocabulary size and token length change.
Explain why the famous Word2Vec relationship “king - man + woman = queen” holds from the perspective of vector space.